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The research in microbial communities would potentially impact a vast number of applications in 
"bio"-related disciplines. Large-scale analyses became a clear trend in microbial community studies, thus it 
is increasingly important to perform efficient and in-depth data mining for insightful biological principles 
from large number of samples. However, as microbial communities are from different sources and of 
different structures, comparison and data-mining from large number of samples become quite difficult. In 
this work, we have proposed a data model to represent large-scale comparison of microbial community 
samples, namely the "Multi-Dimensional View" data model (the MDV model) that should at least include 3 
aspects: samples profile (S), taxa profile (T) and meta-data profile (V). We have also proposed a method for 
rapid data analysis based on the MDV model and applied it on the case studies with samples from various 
environmental conditions. Results have shown that though sampling environments usually define key 
variables, the analysis could detect bio-makers and even subtle variables based on large number of samples, 
which might be used to discover novel principles that drive the development of communities. The efficiency 
and effectiveness of data analysis method based on the MDV model have been validated by the results. 

Microbes are ubiquitous on our planet, and it is well-known that the total number of microbial cells on 
Earth is huge^'^. These organisms usually live in communities, and each of these communities has a 
different taxonomical structure. As such, microbial communities would serve as the largest reservoir of 
genes and genetic functions for a vast number of applications in "bio" -related disciplines, including biomedicine, 
bioenergy, bioremediation, and biodefense^. Since over 90% of strains in a microbial community could not be 
isolated or cultivated^, metagenomic methods have been used to analyze a microbial community as a whole. Such 
an approach has enabled exploring relationships among microbes, their communities and habitats at the most 
fundamental genomic level. Furthermore, environments have profoundly and delicately shaped the microbial 
community structures, thus making microbial communities from different conditions or time-points different, as 
well as making it possible for communities from similar types of environment to be significantly different^. 

With the advancement of microbial community analysis, it is now possible to conduct sample collection, DNA 
extraction and taxonomical structure analysis by an efficient pipeline^'^ for large number of samples. These efforts, 
together with the advanced methods for rapid sample comparison^'^ have enabled the monitoring of microbial 
communities in time-course and under different conditions. For example, microbial community analyses have 
been conducted for monitoring of human microbial communities^'^^"^^, environmental samples of ocean micro- 
bial communities^^ and soil microbial communities^^. 

As large-scale metagenomic analyses become a clear trend in microbial community analysis, data-mining 
methods should keep pace. Based on large volume of microbial community samples, it is becoming more and 
more important to perform in-depth data-mining for valuable biological information on a large scale. Currently 
many tools such as Mothur^^, QIIME^^ and MEGAN^^ provide metagenomic analysis methods for microbial 
communities, which mostly focus on samples alone and ignore the connections to the environmental factors. And 
some of these tools also face difficuties in throughput and data-volume when handreds of samples are to be 
compared and integrated for mining. The basic data-mining requirements are to unveil the correlations between 
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(A) (B) 

Figure 1 | The data model for comparison of a number of microbial community samples. (A) The 3-aspect view for the comparison data model. (B) 
Meta-data could be extended to include multiple environmental and temporal variables including habitat, pH value, etc. Among these meta-data 
variables, some are highly related to human habitat samples, while others are highly related to environmental samples. 



communities and key factors (taxa, environmental factors, etc.), as 
well as the effect of these factors on the changes of these communit- 
ies. For advanced data-mining method development, we believe they 
should have at least two properties: firstly the method should be 
capable of handling large-scale datasets, and secondly the analysis 
results should be profound enough to show the underlining relation- 
ships among microbial community structures, their environments, 
and the ever- changing organisms within samples. 

Though microbial community data are from different sources and 
of different structures, a large-scale comparison of them could be 
presented based on a uniformed data model, namely the "Multi- 
Dimensional View" (MDV) data model that should at least include 
3 aspects (Figure 1, for details refer to "Methods" section): samples 
profile (S), taxa profile (T) and meta-data (environmental conditions 



including sampling time, condition, etc.) profile (V). In other words, 
MDV = {S, T, V}. Among these, "meta-data" profile includes all 
environmental and temporal variables for microbial communities, 
such as host/habitat for human microbiota, temperature, pH value, 
etc. This 3-aspect view (Figure 1 (A)) is a simplified model that could 
include more views such as different batch of experiments and so on 
to become the extended MDV model (Figure 1 (B)). 

Based on this MDV model, the digging of biological relationships 
from communities could be summarized as the data- mining from the 
MDV = {S, T, V} space, and the above-mentioned two key aspects 
for data-mining method development become very natural and clear: 
the deep data-mining would essentially echo the effective clustering 
of those basic elements in the MDV model, and efficiency require- 
ments echo the needs for fast process of such clustering. Thus the 




Figure 2 | The 3 microbial community datasets used in this study, represented in 3D views according to the MDV data model. Each dataset correspond 
to a MDV model with different {S, T, V} space. The MDV cubes were generated using SVG (Scalable Vector Graphics) and photos were captured by one of 
the authors (Xiaoquan Su) in-house. 
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Table 1 Information of the Human-associated habitat samples 
Host (v] ) Habitat(v2) Number of samples ( S ) 
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effective and efficient clustering of basic elements in the MDV model 
would be the core for the success of large-scale microbial community 
data-mining. 

In this work, we focused on inferring the correlation between the 
taxa profile (T) and meta-data (V) by data-mining method in the 
MDV model, i.e., comparison of samples with different meta-data. 
We have proposed a method for the rapid data comparison and 
correlation analysis among microbial community samples based 
on the MDV model, which is supported by High- Performance 
Computation for rapid process. This method has also been applied 
on 3 sets of samples from different conditions including human- 
associated habitats, soil and marine water, each of which has a large 
number of samples. These datasets are of different complexity and 
comes with different meta-data, therefore they are suitable for assess- 
ment of data model and data analysis methods. The comparison and 
correlation analysis results based on these datasets have showed 
excellent performance of our method for in-depth data-mining from 
massive number of microbial community samples. 

Results 

Microbial community samples. We have evaluated the efficiency of 
sample comparison and correlation analysis method in MDV spaces 
based on 3 microbial community datasets. The 3 sets of microbial 
community samples were gathered from different environments, 
each having a large number of samples (Figure 2). Dataset A 
contains 258 human-associated microbial community samples 
from 3 different habitats of 6 individuals, which were produced by 
Caporaso, et al, PNAS 2011^^ and Caporaso, et al. Genome Biology, 
2011^^ (refer to Table SI in supporting information File SI for 
details); Dataset B contains 40 microbial samples from marine 
surface water sampled at 3 different time-points, which were 
produced by Caporaso, et al, PNAS 2011^^ (refer to Table S2 in 
supporting information File SI for details); Dataset C contains 42 
soil microbial community samples of 3 different locations, produced 
by the same work as Dataset B (refer to Table S3 in supporting 
information File SI for details). These 3 datasets thus represented 
broad-based microbial communities that also have important 
biological applications. All of these microbial community samples' 
sequencing data were produced by Illumina GAIIx from 16S rRNA 
genes. 

Results on human-associated habitat microbial community samples. 

The commensal microorganisms living in our gut^°'^\ skin^^'^^ and 
various other places have key roles in our physiology^^, including our 
immune responses and metabolism, as well as in various human 
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Figure 3 | Similarity matrix of Human-associated habitat microbial 
community samples. (A) Hosts were from different families. (B) Hosts 
were from the same family. Each tile represents a similarity value between 
two samples from a color gradient between red and green: red color 
indicates higher similarity value and green color indicates lower similarity 
value, with red/green shades in between indicating intermediate values. 

diseases^^ Since hosts and sampling times would significantly affect 
the structure of human-associated habitat microbial communities, 
the combination of large amount of samples together with their 
meta-data would serve as a good benchmark for testing analysis 
methods. 

In this case study, we have obtained 258 human-associated habi- 
tats microbial community samples from 3 different habitats (gut 
samples from feces, skin samples from palms and oral samples from 
tongue) of 6 individuals (Table 1). In the MDV model, |S| = 258 and 
V = {Host, Habitat}. Among the 6 hosts, 2 (Female 5 and Male 6) 
were from the same family, which were obtained from Caporaso, et 
al. Genome Biology, 201 while others were from different families 
(Female 1, Male 2, Male 3 and Male 4) with samples' sequences 
produced by different primers, which were obtained from 
Caporaso, et al, PNAS 201 



Table 2 Prominent taxa 
ferent habitats 

Taxon 


which could distinguish samples from dif- 
Habitat P-value 


Bacteroidaceae 
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Figure 4 | PCoA analysis results for samples from the same family. Samples were categorized by habitats on left, and by hosts on right. 



We have first generated pair- wise similarity matrices with all 258 
samples based on their taxonomical structure among samples ((S, T) 
space of the MDV model) from different families (Figure 3 (A)) and 
the same family (Figure 3 (B)), respectively. Then we used hierarch- 
ical-based clustering methods based on similarly matrices to examine 



the relationship among different human microbiota (for details refer 
to "Methods" section). Results (Figure 3) have shown that samples 
from the same habitat were clustered together, and samples from skin 
and oral environment shared more common structures, yet com- 
munity structures for samples within gut were significantly different. 
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(C) 

Figure 5 | Clustering and bio-marker analysis results of marine samples. (A) Hierarchical-based clustering results to discover the relationships 
among samples, in which the more similar the two samples the deeper dark red color. (B) Density-based clustering result to examine the major 
differentiation factors, in which nodes represent samples, and edges between nodes indicated that their similarities were above the threshold of 85%. 
(C) The relative abundances distribution for all marine water samples for 3 most dynamic taxa in marine samples. 
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Table 3 Information 


of soil samples 








Type (vi) 


Location (V2) 


pH (V3) 


Number of samples ( S 




Desert scrub soil 
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Pine soil 
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This clustering pattern by habitats indicated that among the various 
meta-data (V space of the MDV model, including family background 
(possibly related to diet^^), host and habitats), habitat played a more 
important role in shaping the community structures for these sam- 
ples. Further probing of the bio-marker taxa in (T, V) space of the 
MDV model (for details refer to "Methods" section) that caused such 
pattern has shown that Bacteroidaceae and Clostridiaceae (dominat- 
ing gut microbial communities), Prevotellaceae and Pasteur ellaceae 
(dominating oral microbial communities), and Corynebacterineae 
(dominating skin microbial communities) were the most prominent 
taxa (Table 2) that could distinguish samples from different habitats. 

We noticed that among the hosts in different families, most sam- 
ples from the same host could be clustered together for each habitat 
(Figure 3 (A)). Only few samples labeled with "Male_4_Gut" were 
divided into two groups probably due to the reason that sequences 
produced by different primers were from the different regions of 16S 
rRNA gene). Additionally, among family members (Female 5 and 
Male 6), samples of the same habitat could not be distinguished by 
host (Figure 3 (B)). The most abundant taxa in samples from Female 
5 and Male 6 include Bacteroidaceae (P-value = 0.346), 
Prevotellaceae (P-value = 0.777), Pasteur ellaceae (P-value = 
0.809) and Streptococcus (P-value = 0.741) which showed high sim- 
ilarity in relative abundances due to the strong effect from small-scale 
environment of the same family^^, thus making the differentiation 
difficult. 

Furthermore, we conducted the PCoA (Principal Coordinates 
Analysis) analysis based on sample similarity matrix from the same 
family to examine the correlation of the microbial community pat- 
terns to hosts and habitats. It was obvious in the PCoA results 
(Figure 4) that samples could be differentiated by habitats, but sam- 
ples from the same habitats but different family members were mixed 
together because they shared similar community structure patterns. 

Results on microbial community samples from marine water. 

Marine microbial communities play a very important role in the 
regulation of carbon and nitrogen circulation of the globe^^, and 
they contain important genes for a wide application area such as 
bioenergy, bioremediation, etc^^. However, marine samples are 
very diverse in their structure as well as function, making 
knowledge discovery from them quite challenging. 

In this work, we applied our method to analyze 40 microbial 
samples produced by Caporaso, et al, PNAS 201 from marine 
surface water of Newport Beach Pier, CA, US collected at different 
time-points (seasons) These samples were collected from 3 differ- 
ent time-points (seasons) at the same location. In the MDV model, 
|S| =40 and V = {Time, Temperature}. We used hierarchical-based 
method to evaluate the relationships among all marine water com- 
munities and density-based clustering methods MCODE^^ (for 
details refer to "Methods" section) to examine the major differenti- 
ation factors during time- course based on the pair- wise similarity 
matrix. 

Results from Figure 5(A) and Figure 5 (B) indicated that all sam- 
ples could be divided into three groups by the meta-data of sampling 
time-point (V space in the MDV model). Since these marine water 
samples were collected from a similar site (a near- coast site) and 
water- depth (surface) yet at 3 different time-points (seasons) with 
different water temperature, the microbial community structures 
showed high correlation with V = water temperature in the MDV 



model in Figure 5 (B), which has also been proven in other works^°. 
Detailed analyses on bio -markers in (T, V) space of the MDV model 
have shown that the relatively abundant and most dynamic taxa for 
these samples include Flavobacteriaceae (P-value = 0.00095), 
Prochlorococcus (P-value = 0.00056), and Rhodobacteraceae (P- 
value = 0.00056) (Figure 5(C)), all of which were sensitive to water 
temperature as well. Additionally, from Figure 5 (A) we observed that 
though each cluster of samples had high intra- cluster similarity, 
samples from time-point 2 were not similar enough with any of 
the samples from time-point 1 and time-point 3, indicating that 
meta-data for samples from time-point 2 might be drastically differ- 
ent. Our analyses on the above 3 most dynamic taxa have also shown 
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Figure 6 | Clustering analysis results of soil samples based on 
hierarchical-based clustering. 
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Figure 7 | Correlation analysis result based on soil samples. (A) PCoA analysis results of soils samples. (B) Correlation of taxa abundances with Vi = pH 
values. R was the Pearson correlation coefficient for the pH value against the relative abundance in all soil samples. 



that compared to samples from time-point 1 and time-point 3, sam- 
ples from time-point 2 always have different taxa abundances with 
regard to Flavobacteriaceae, Prochlorococcus and Rhodobacteraceae 
(Figure 5(C)). 

Results on microbial community samples from soil. Soil microbial 
communities belong to a type representing the most important 
communities on land for regulation of the carbon and nitrogen circu- 
lation on earth^^'^^, and they were directly related to agriculture 
researches^". Soil microbial communities also represented the most 
complex, diverse and dynamic communities on earth^^. 



We have used 42 soil microbial community samples of 3 different 
places each with different pH values from the work of Caporaso, 
et al, PNAS 201 to demonstrate the performance of our method. 
For the soil samples, both 3' reads and 5' reads which were the 
sequencing results of 16S rRNA genes in two complementary direc- 
tions by different primers were generated and analyzed together 
(Table 3). In the MDV model, |S| = 42 and V = {Type, Location, 
pH, Primer}. We then processed the samples with hierarchical-based 
clustering method (for details refer to "Methods" section) based on 
their similarity matrix to discover the corresponding environmental 
patterns. 
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Figure 8 | Running time for the whole data-mining procedures. Bar chart illustrated the running time comparison between CPU 
(Tesla M2075) computing. The Y-axis was in 10-based log scale. Pie charts showed the proportions of each processing step in the 
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Figure 9 | The overall scheme for microbial community data-mining. 

From the results (Figure 6) we observed that all samples could be 
divided into 3 groups, mainly by the pH values of the sampling 
environments. We also noticed that in each group, samples 
sequenced by 3' primer and 5' primer could be distinguished from 
the clustering results due to the technical specification of sequencing 
that sequences produced by 3' primer and 5' primer were from 
different regions of 16S rRNA genes. We also verified our results 
using the Fast UniFrac^^ algorithm and obtained similar results (refer 
to Figure SI in Supporting information File SI for details). 

We further investigated the correlation between the community 
structures of soil samples and their environment factors by PCoA 
(Principal Coordinates Analysis) in (T, V) space of the MDV model. 
Results in Figure 7 (A) elucidated the high correlation of the com- 
munity structure to the pH values: both 3 ' reads samples and 5' reads 
samples were ordered from alkalinity soil to acid soil (from pH 8.3 to 
pH 4.9), and sample from the acid and semiacid environment were 
more similar (samples from pH 4.9 soil and pH 6.1 soil), which has 
been proved by Fierer et al, PNAS 2006^^. 

Then we performed the bio-marker analysis to discover the 
abundant key taxa that strongly correlated with Vi = pH value. As 
soil microbial communities were much more complex with a huge 
number (> 1,000) of species in each sample, a taxon with more than 
5% relative proportion in the community was already very abundant. 
The abundance variation of taxa Sphingomonadaceae (Pearson 
correlation coefficient R = 0.9537, abundances 0.6%- 15%), Rubro- 
bacterineae (R = 0.9696, abundances 0.9%-5.5%) and Micromono- 
sporineae (R = 0.9296, abundances 0.5%-5.3%) had strong positive 
correlation with pH values, as well as Burkholderiaceae (R = 
— 0.9832, abundances 0.3%-3.4%) were highly negative correlated 
to pH values, which would be the reason behind the strong correla- 
tion of community structure with pH values (Figure 7 (B)). In addi- 
tion, there was no significant correlation (|R| < 0.7) for pH values 
and other abundant taxa. This further confirmed that the pH values 
might affect soil microbial communities significantly through the 
changes of these abundant taxa^^. 

Efficiency analysis. We have also evaluated the running time of data- 
mining analysis including similarity matrix construction, clustering 
and correlation analysis, based on the 3 sets of microbial commu- 
nities. Benefited by the GPU based High Performance Computing 



(HPC)^ in the most time-consuming process of similarity matrix 
construction (Figure 8, pie charts), the overall computing speed of 
GPU achieved more than 60 times speed-up compared to computing 
speed of CPU, with 16 cores (Figure 8, bar charts). This HPC strategy 
has made possible data-mining on 258 samples (dataset A) to be 
completed within only 2 minutes, out of which nearly 30% of time 
was spent on clustering and correlation analyses. 

Discussion and Conclusion 

As large amount of metagenomic data could be accumulated quickly 
from various microbial community profiling projects using NGS, it is 
becoming more and more important to perform in-depth analysis of 
microbial communities, as well as data-mining for valuable yet hid- 
den biological principles that controls the dynamic changes of micro- 
bial community samples. The basic questions based on such a large 
amount of samples would be the comparison and correlation analysis 
which include the understanding of relationships among communit- 
ies, key factors (taxa, environmental factor, etc.) for such relation- 
ships, as well as the effect of environmental and/or temporal factors 
on community dynamics. 

One apparent yet critical problem for data-mining from large 
number of microbial communities is the heterogeneity of samples 
(different sources, different meta-data, different structure, etc.). In 
this work, we have proposed a data model to represent large-scale 
comparison of these samples, namely the "multi- dimensional view" 
data model (MDV = {S, T, V}) that consisted of 3 basic aspects: 
sample profile, taxa profile and meta-data profile. The effective 
and efficient analysis among different elements in the MDV model 
is the core for the success of large-scale microbial community data- 
mining. We have also proposed a method for the rapid data 
comparison and correlation analysis among microbial community 
samples based on the MDV model, which is supported by High- 
Performance Computation for rapid process. The comparison and 
correlation analysis results based on datasets from various sampling 
conditions showed excellent performance for in-depth data-mining 
from massive number of microbial community samples. 

The MDV model is not only restricted by sample clustering, but 
could also be used for taxa clustering as well. Based on taxa clustering 
(in T space), important biomarkers for distinguishing samples could 
be discovered^^'^^. Clustering from another angle of meta-data (in V 
space) would also help to distinguish important environmental or 
temporal factors that would affect the dynamics of microbial com- 
munity samples. These future works based on the MDV model would 
serve well for more data-mining and in-depth understanding of the 
underlining principle controlling the functions and evolution of vari- 
ous microbial communities, which would also have great potential in 
applications. 

Methods 

The MDV data model. The "Multi-Dimensional View" (MDV) data model includes 
3 aspects (Figure 1): sample profile (S), taxa profile (T) and meta-data profile (V), 
which could be integrated by formula 1: 



!S=(Si,S2,...,Sn) 
T^=fphylogeny{U,h,--,tm) 
V= (vi,V2,...,Vq) 



(1) 



In this 3-dimensional view (3D view), sample profiles S = (si, S2, ..Sn) contains the ID 
and basic information about the samples; taxa profiles T = (ti, t2, t^) contains 
community structure information about the taxa, their relative abundances in 
different samples and their phylogenetic relationship (represented hy fphyhgeney in 
Formula 1); meta-data profiles V = (vi, V2, . . ., Vq) contains the meta-data (sampling 
time, environment condition, etc.) of all samples. In this work, we focus on analysing 
the relationships among samples with different meta-data. This is equivalent to 
inferring the correlation between the taxa profile (T) and meta-data (V) by data- 
mining in the MDV model, which could also be describe by Formula 2: 
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The MDV Model 



Sample comparison and 
similarity matrix construction 



Figure 10 | The GPU-based High-Performance Computation strategy in the MDV model. 



Data — mining 

(S,T,V) (T,V) 



(2) 



The data-mining method. The rapid data-mining procedure includes community 
structure analysis, similarity- matrix construction, sample clustering and correlation 
with meta-data based on the MDV model. The overall scheme is illustrated in 
Figure 9: 

Microbial community structure analysis. The community structure profiles of all 
samples are parsed out from their 16S rRNA gene sequences by high efficient 
metagenomic analysis tool Parallel- MET A^*^ (version 2.0). Parallel-META maps the 
16S rRNA sequences of each sample by MegaBLAST^^ to the reference database to 
identify the taxonomical classification and phylogenetic relationship of each species. 
In this work we use the GreenGenes^*^ core-set (release date: May 2009) as the ref- 
erence database and lE-30 as the expectation value for MegaBLAST based database 
mapping. 

Similarity matrix construction. The similarity matrix reflects the similarity of samples 

in S = (si, S2, , Sn) space based on their taxonomical structure data T = 

fphyiogenyihy > t^). The similarity score between two microbial community 

samples evaluates as a quantitative similarity (always a float value between 0% and 
100%) calculated by Meta-Storms*^'^ algorithm based on the community structure 
analysis results. The similarity matrix of N samples that consisted by N*N pairs 
represents pair- wise similarity, in which each pair indicated the similarity score of one 
sample pair. Based on the permutation test results in our previous work*^, a similarity 
score of 85% or higher indicates significant similarity between 2 samples. 

Clustering methods. Clustering methods includes hierarchical-based method and 
density-based method from MDV = {S, T, V} space. The hierarchical-based clus- 
tering elucidates the relationships among the microbial community samples and 
sample groups, while the density-based clustering focuses on discovering sample 
groups with significant difference defined by a given threshold. The density-based 
clustering is also used for validity check for the results of hierarchical-based 
clustering. 

(a) The hierarchical-based clustering method is implemented by "HClust" func- 
tion of CRAN R^^, and results are visualized by MetaSee software^" and 
"gplots" package (Gregory R., et al, gplots: Various R programming tools 
for plotting data. http://CRAN.R-project.org/package= gplots) of CRAN R. 
In the hierarchical-based clustering, distances among different clusters were 
evaluated using the "average linkage" (http://stat.ethz.ch/R-manual/R-devel/ 
library/ stats/html/hclust.html) method. 

(b) The density-based clustering method is implemented by MCODE^^ and 
results are visualized in Cytoscape software''\ Based on permutation tests*^, 
similarity score of 85% or higher indicates the significant similarity between 2 
samples. In the density-based clustering analysis we select 85% as the thresh- 
old for significant difference. 

Correlation and bio-marker selection methods. The correlation analysis attempts to 

discover relationships between taxa profiles T = fphyhgemyiUy hy > tm) space and 

V = (vi, V2, , Vq) space based on the clustering results to deduce the /correlation (T, 

V) in Formula 2. The Principal Coordinates Analysis (PCoA) are used to elucidate the 
correlation between community structures and meta-data based on the similarity 
matrix, which is implemented by "vegan" package (ari Oksanen, et al, vegan: 



Community Ecology Package. http://CRAN.R-project.org/package=vegan) of 
CRAN R. Then we also select the bio -markers which are considered as abundant taxa 
that have high correlation with the meta-data and clustering results. For the 
numerical meta-data (such as pH value, temperature, etc.), we calculate the Pearson 
correlation coefficient (R) between abundance values of specified taxa and meta-data, 
and select the taxa with R value equal to or larger than 0.9 which indicate the 
significant correlation between abundance values and meta-data. For the discrete 
meta-data (such as human-associated habitat, location, etc.), we perform the 
Wilcoxon and Kruskal rank- sum test and select the taxa with P -value smaller or equal 
to 0.01, which indicate the significant difference of abundance values among different 
meta-data. 

High-performance computing. The MDV data model has been considered for 
parallel processing of sample comparison. The similarity among microbial 
community samples are evaluated by the similarity scores in (T, V) space of the MDV 
model. The similarity score between each sample pair is calculated by Meta- Storms*^ 
algorithm with time complexity of Mog(Ar) (AT is the number of species existing in one 
sample). However, as the amount of samples increases, the overall time complexity of 
M 2* Mog(N) (M is the number of samples) based on pair- wise comparison always 
leads to an unacceptable running time. 

In this work, we have performed the calculation of the similarity matrix for massive 
number of samples using GPU-Meta- Storms^ based on NVIDIA Tesla M2075 GPU 
hardware (448 stream processors, 6 GB onboard memory). To calculate the similarity 
matrix of N samples, N * N threads are launched in GPU with many- core architecture 
to let each similarity score in the matrix be processed by one independent thread in 
parallel (Figure 10). To fully utilize the GPU-based computation power, we have also 
designed optimization strategies including global memory alignment, register recal- 
Hng allocation and shared memory utilization in I/O (Input/Output) operations to 
improve the overall performance by GPU computing. 
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